129 research outputs found

    Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework

    Full text link
    Speech recognition systems for irregularly-spelled languages like English normally require hand-written pronunciations. In this paper, we describe a system for automatically obtaining pronunciations of words for which pronunciations are not available, but for which transcribed data exists. Our method integrates information from the letter sequence and from the acoustic evidence. The novel aspect of the problem that we address is the problem of how to prune entries from such a lexicon (since, empirically, lexicons with too many entries do not tend to be good for ASR performance). Experiments on various ASR tasks show that, with the proposed framework, starting with an initial lexicon of several thousand words, we are able to learn a lexicon which performs close to a full expert lexicon in terms of WER performance on test data, and is better than lexicons built using G2P alone or with a pruning criterion based on pronunciation probability

    Adapted Extended Baum-Welch transformations

    No full text
    The discrimination technique for estimating parameters of Gaussian mixtures that is based on the Extended Baum-Welch transformations (EBW) has had significant impact on the speech recognition community. \ud In this paper we introduce a general definition of a family of EBW transformations that can be associated with a weighted sum of updated and initial models. We compute a gradient steepness measurement for a family of EBW transformations that are applied to functions of Gaussian mixtures and demonstrate the growth property of these transformations. We consider EBW transformations of discriminative functions in which EBW controlled parameters are adapted to a gradient steepness measurement or to the likelihood of the data given the model. We present experimental results that show that adapted EBW transformations can significantly speed up estimating parameters of Gaussian mixtures and give better decoding results

    Characterising the mechanical properties of soft solids through acoustics and rheology, exemplified by anhydrous milk fat

    Get PDF
    Foods vary in their elastic properties over a wide range of behaviours. In the case of mastication, textures vary from hard solid through brittle (chocolate bar) and crispy/crunchy (biscuits) to viscous and extensional flow (syrup) and finally very low viscosity fluid (water). Here we deploy an elastic description of soft solids which embraces all these behaviours to quantify the elastic behaviour of food, in particular through the use of sound. We illustrate the use of this mathematical description in the quantitative characterisation of the elastic and flow properties of food through orthodox measurement techniques and novel ultrasound methods. Measurement is complicated by human sensory capabilities that span the entire range from solid to fluid to gas in an integrated manner, during the appreciation of food. We use acoustic and rheological measurement techniques for the determination of the mechanical properties of soft solids, comparing oscillatory rheometry with acoustic parameters as exemplified by acoustic and oscillatory rheometry measurements in crystallising anhydrous milk fat (AMF). We conclude that acoustic and rheological measurements complement each other with acoustic techniques offering the possibility of inline, in process determination of mechanical and flow properties such as viscosity, rigidity, compressibility and bulk modulus

    GPU-accelerated Guided Source Separation for Meeting Transcription

    Full text link
    Guided source separation (GSS) is a type of target-speaker extraction method that relies on pre-computed speaker activities and blind source separation to perform front-end enhancement of overlapped speech signals. It was first proposed during the CHiME-5 challenge and provided significant improvements over the delay-and-sum beamforming baseline. Despite its strengths, however, the method has seen limited adoption for meeting transcription benchmarks primarily due to its high computation time. In this paper, we describe our improved implementation of GSS that leverages the power of modern GPU-based pipelines, including batched processing of frequencies and segments, to provide 300x speed-up over CPU-based inference. The improved inference time allows us to perform detailed ablation studies over several parameters of the GSS algorithm -- such as context duration, number of channels, and noise class, to name a few. We provide end-to-end reproducible pipelines for speaker-attributed transcription of popular meeting benchmarks: LibriCSS, AMI, and AliMeeting. Our code and recipes are publicly available: https://github.com/desh2608/gss.Comment: 7 pages, 4 figures. Code available at https://github.com/desh2608/gs

    Probing the Information Encoded in X-vectors

    Full text link
    Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks. In this paper, we use simple classifiers to investigate the contents encoded by x-vector embeddings. We probe these embeddings for information related to the speaker, channel, transcription (sentence, words, phones), and meta information about the utterance (duration and augmentation type), and compare these with the information encoded by i-vectors across a varying number of dimensions. We also study the effect of data augmentation during extractor training on the information captured by x-vectors. Experiments on the RedDots data set show that x-vectors capture spoken content and channel-related information, while performing well on speaker verification tasks.Comment: Accepted at IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 201
    • …
    corecore